{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "f37ca393",
   "metadata": {},
   "source": [
    "# Structured Profilers"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ff9bd095",
   "metadata": {},
   "source": [
    "**Data profiling** - *is the process of examining a dataset and collecting statistical or informational summaries about said dataset.*\n",
    "\n",
    "The Profiler class inside the DataProfiler is designed to generate *data profiles* via the Profiler class, which ingests either a Data class or a Pandas DataFrame. \n",
    "\n",
    "Currently, the Data class supports loading the following file formats:\n",
    "\n",
    "* Any delimited  (CSV, TSV, etc.)\n",
    "* JSON object\n",
    "* Avro\n",
    "* Parquet\n",
    "* Text files\n",
    "* Pandas Series/Dataframe\n",
    "\n",
    "Once the data is loaded, the Profiler can calculate statistics and predict the entities (via the Labeler) of every column (csv) or key-value (JSON) store as well as dataset wide information, such as the number of nulls, duplicates, etc.\n",
    "\n",
    "This example will look at specifically the structured data types for structured profiling. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "de58b9c4",
   "metadata": {},
   "source": [
    "## Reporting"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8001185a",
   "metadata": {},
   "source": [
    "One of the primary purposes of the Profiler are to quickly identify what is in the dataset. This can be useful for analyzing a dataset prior to use or determining which columns could be useful for a given purpose.\n",
    "\n",
    "In terms of reporting, there are multiple reporting options:\n",
    "\n",
    "* **Pretty**: Floats are rounded to four decimal places, and lists are shortened.\n",
    "* **Compact**: Similar to pretty, but removes detailed statistics such as runtimes, label probabilities, index locations of null types, etc.\n",
    "* **Serializable**: Output is json serializable and not prettified\n",
    "* **Flat**: Nested Output is returned as a flattened dictionary\n",
    "\n",
    "The **Pretty** and **Compact** reports are the two most commonly used reports and includes `global_stats` and `data_stats` for the given dataset. `global_stats` contains overall properties of the data such as number of rows/columns, null ratio, duplicate ratio. `data_stats` contains specific properties and statistics for each column file such as min, max, mean, variance, etc.\n",
    "\n",
    "For structured profiles, the report looks like this:\n",
    "\n",
    "```\n",
    "\"global_stats\": {\n",
    "    \"samples_used\": int,\n",
    "    \"column_count\": int,\n",
    "    \"row_count\": int,\n",
    "    \"row_has_null_ratio\": float,\n",
    "    \"row_is_null_ratio\": float,    \n",
    "    \"unique_row_ratio\": float,\n",
    "    \"duplicate_row_count\": int,\n",
    "    \"file_type\": string,\n",
    "    \"encoding\": string,\n",
    "},\n",
    "\"data_stats\": [\n",
    "    {\n",
    "        \"column_name\": string,\n",
    "        \"data_type\": string,\n",
    "        \"data_label\": string,\n",
    "        \"categorical\": bool,\n",
    "        \"order\": string,\n",
    "        \"samples\": list(str),\n",
    "        \"statistics\": {\n",
    "            \"sample_size\": int,\n",
    "            \"null_count\": int,\n",
    "            \"null_types\": list(string),\n",
    "            \"null_types_index\": {\n",
    "                string: list(int)\n",
    "            },\n",
    "            \"data_type_representation\": [string, list(string)],\n",
    "            \"min\": [null, float],\n",
    "            \"max\": [null, float],\n",
    "            \"mean\": float,\n",
    "            \"variance\": float,\n",
    "            \"stddev\": float,\n",
    "            \"histogram\": { \n",
    "                \"bin_counts\": list(int),\n",
    "                \"bin_edges\": list(float),\n",
    "            },\n",
    "            \"quantiles\": {\n",
    "                int: float\n",
    "            }\n",
    "            \"vocab\": list(char),\n",
    "            \"avg_predictions\": dict(float), \n",
    "            \"data_label_representation\": dict(float),\n",
    "            \"categories\": list(str),\n",
    "            \"unique_count\": int,\n",
    "            \"unique_ratio\": float,\n",
    "            \"precision\": {\n",
    "                'min': int,\n",
    "                'max': int,\n",
    "                'mean': float,\n",
    "                'var': float,\n",
    "                'std': float,\n",
    "                'sample_size': int,\n",
    "                'margin_of_error': float,\n",
    "                'confidence_level': float\t\t\n",
    "            },\n",
    "            \"times\": dict(float),\n",
    "            \"format\": string\n",
    "        }\n",
    "    }\n",
    "]\n",
    "```\n",
    "\n",
    "In the example, the `compact` format of the report is used to shorten the full list of the results. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5fcb5447",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import sys\n",
    "import json\n",
    "\n",
    "try:\n",
    "    sys.path.insert(0, '..')\n",
    "    import dataprofiler as dp\n",
    "except ImportError:\n",
    "    import dataprofiler as dp\n",
    "\n",
    "data_path = \"../dataprofiler/tests/data\"\n",
    "\n",
    "# remove extra tf loggin\n",
    "import tensorflow as tf\n",
    "tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.ERROR)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a7fc2df6",
   "metadata": {
    "scrolled": true,
    "tags": []
   },
   "outputs": [],
   "source": [
    "data = dp.Data(os.path.join(data_path, \"csv/aws_honeypot_marx_geo.csv\"))\n",
    "profile = dp.Profiler(data)\n",
    "\n",
    "# Compact - A high level view, good for quick reviews\n",
    "report  = profile.report(report_options={\"output_format\":\"compact\"})\n",
    "print(json.dumps(report, indent=4))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d7ec39d2",
   "metadata": {},
   "source": [
    "It should be noted, in addition to reading the input data from multiple file types, DataProfiler allows the input data as a dataframe. To get more results related to detailed predictions at the entity level from the DataLabeler component or histogram results, the format `pretty` should be used. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "29737f25",
   "metadata": {
    "scrolled": true,
    "tags": []
   },
   "outputs": [],
   "source": [
    "# run data profiler and get the report\n",
    "import pandas as pd\n",
    "my_dataframe = pd.DataFrame([[1, 2.0],[1, 2.2],[-1, 3]], columns=[\"col_int\", \"col_float\"])\n",
    "profile = dp.Profiler(my_dataframe)\n",
    "\n",
    "report  = profile.report(report_options={\"output_format\":\"pretty\"})\n",
    "print(json.dumps(report, indent=4))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "241f6e3e",
   "metadata": {},
   "source": [
    "## Profiler Type"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5b20879b",
   "metadata": {},
   "source": [
    "The profiler will infer what type of statistics to generate (structured or unstructured) based on the input. However, you can explicitly specify profile type as well. Here is an example of the the profiler explicitly calling the structured profile."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "dc44eb47",
   "metadata": {
    "scrolled": true,
    "tags": []
   },
   "outputs": [],
   "source": [
    "data = dp.Data(os.path.join(data_path, \"csv/aws_honeypot_marx_geo.csv\"))\n",
    "profile = dp.Profiler(data, profiler_type='structured')\n",
    "\n",
    "# print the report using json to prettify.\n",
    "report = profile.report(report_options={\"output_format\": \"pretty\"})\n",
    "print(json.dumps(report, indent=4))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fe02ad64",
   "metadata": {},
   "source": [
    "## Profiler options"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "40804cc9",
   "metadata": {},
   "source": [
    "The DataProfiler has the ability to turn on and off components as needed. This is accomplished via the `ProfilerOptions` class.\n",
    "\n",
    "For example, if a user doesn't require histogram information they may desire to turn off the histogram functionality. Simialrly, if a user is looking for a more accurate labeling, they can increase the samples used to label.\n",
    "\n",
    "Below, let's remove the histogram and increase the number of samples to the labeler component (1,000 samples). \n",
    "\n",
    "Full list of options in the Profiler section of the [DataProfiler documentation](https://capitalone.github.io/DataProfiler/profile_options.html)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9d25d899",
   "metadata": {
    "scrolled": true,
    "tags": []
   },
   "outputs": [],
   "source": [
    "data = dp.Data(os.path.join(data_path, \"csv/diamonds.csv\"))\n",
    "\n",
    "profile_options = dp.ProfilerOptions()\n",
    "\n",
    "# Setting multiple options via set\n",
    "profile_options.set({ \"histogram.is_enabled\": False, \"int.is_enabled\": False})\n",
    "\n",
    "# Set options via directly setting them\n",
    "profile_options.structured_options.data_labeler.max_sample_size = 1000\n",
    "\n",
    "profile = dp.Profiler(data, options=profile_options)\n",
    "report  = profile.report(report_options={\"output_format\":\"compact\"})\n",
    "\n",
    "# Print the report\n",
    "print(json.dumps(report, indent=4))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2052415a",
   "metadata": {},
   "source": [
    "## Updating Profiles"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7e02f746",
   "metadata": {},
   "source": [
    "Beyond just profiling, one of the unique aspects of the DataProfiler is the ability to update the profiles. To update appropriately, the schema (columns / keys) must match appropriately."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7ab8022f",
   "metadata": {
    "scrolled": true,
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Load and profile a CSV file\n",
    "data = dp.Data(os.path.join(data_path, \"csv/sparse-first-and-last-column-header-and-author.txt\"))\n",
    "profile = dp.Profiler(data)\n",
    "\n",
    "# Update the profile with new data:\n",
    "new_data = dp.Data(os.path.join(data_path, \"csv/sparse-first-and-last-column-skip-header.txt\"))\n",
    "# new_data = dp.Data(os.path.join(data_path, \"iris-utf-16.csv\")) # will error due to schema mismatch\n",
    "profile.update_profile(new_data)\n",
    "\n",
    "# Take a peek at the data\n",
    "print(data.data)\n",
    "print(new_data.data)\n",
    "\n",
    "# Report the compact version of the profile\n",
    "report  = profile.report(report_options={\"output_format\":\"compact\"})\n",
    "print(json.dumps(report, indent=4))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "66ec6dc5",
   "metadata": {},
   "source": [
    "## Merging Profiles"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e2265fe9",
   "metadata": {},
   "source": [
    "Merging profiles are an alternative method for updating profiles. Particularly, multiple profiles can be generated seperately, then added together with a simple `+` command: `profile3 = profile1 + profile2`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cc68ca07",
   "metadata": {
    "scrolled": true,
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Load a CSV file with a schema\n",
    "data1 = dp.Data(os.path.join(data_path, \"csv/sparse-first-and-last-column-header-and-author.txt\"))\n",
    "profile1 = dp.Profiler(data1)\n",
    "\n",
    "# Load another CSV file with the same schema\n",
    "data2 = dp.Data(os.path.join(data_path, \"csv/sparse-first-and-last-column-skip-header.txt\"))\n",
    "profile2 = dp.Profiler(data2)\n",
    "\n",
    "# Merge the profiles\n",
    "profile3 = profile1 + profile2\n",
    "\n",
    "# Report the compact version of the profile\n",
    "report  = profile3.report(report_options={\"output_format\":\"compact\"})\n",
    "print(json.dumps(report, indent=4))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7ea07dc6",
   "metadata": {},
   "source": [
    "As you can see, the `update_profile` function and the `+` operator function similarly. The reason the `+` operator is important is that it's possible to *save and load profiles*, which we cover next."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "375ff25c-b189-436a-b07d-5e7f13cc6e03",
   "metadata": {},
   "source": [
    "## Differences in Data\n",
    "Can be applied to both structured and unstructured datasets. \n",
    "\n",
    "Such reports can provide details on the differences between training and validation data like in this pseudo example:\n",
    "```python\n",
    "profiler_training = dp.Profiler(training_data)\n",
    "profiler_testing = dp.Profiler(testing_data)\n",
    "\n",
    "validation_report = profiler_training.diff(profiler_testing)\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "65360a03-e3ff-4f3c-9963-412298fdb284",
   "metadata": {
    "scrolled": true,
    "tags": []
   },
   "outputs": [],
   "source": [
    "from pprint import pprint\n",
    "\n",
    "# structured differences example\n",
    "data_split_differences = profile1.diff(profile2)\n",
    "pprint(data_split_differences)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2ae471ff-852f-400a-9bee-5c9fef96f10a",
   "metadata": {},
   "source": [
    "## Graphing a Profile\n",
    "\n",
    "We've also added the ability to generating visual reports from a profile.\n",
    "\n",
    "The following plots are currently available to work directly with your profilers:\n",
    "\n",
    "  * missing values matrix\n",
    "  * histogram (numeric columns only)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "734b588d-ac9a-409c-8eb5-b1a0aede8c63",
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "\n",
    "\n",
    "# get the data\n",
    "data_folder = \"../dataprofiler/tests/data\"\n",
    "data = dp.Data(os.path.join(data_folder, \"csv/aws_honeypot_marx_geo.csv\"))\n",
    "\n",
    "# profile the data\n",
    "profile = dp.Profiler(data)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e4e70204-fa30-43c2-9556-e84c19f82d32",
   "metadata": {},
   "outputs": [],
   "source": [
    "# generate a missing values matrix\n",
    "fig = plt.figure(figsize=(8, 6), dpi=100)\n",
    "fig = dp.graphs.plot_missing_values_matrix(profile, ax=fig.gca(), title=\"Missing Values Matrix\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d734d355-e542-4245-a1e9-66521e333c2d",
   "metadata": {},
   "outputs": [],
   "source": [
    "# generate histogram of all int/float columns\n",
    "fig = dp.graphs.plot_histograms(profile)\n",
    "fig.set_size_inches(8, 6)\n",
    "fig.set_dpi(100)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "30868000",
   "metadata": {},
   "source": [
    "## Saving and Loading a Profile"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f2858072",
   "metadata": {},
   "source": [
    "Not only can the Profiler create and update profiles, it's also possible to save, load then manipulate profiles."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2ad9ca57",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load data\n",
    "data = dp.Data(os.path.join(data_path, \"csv/names-col.txt\"))\n",
    "\n",
    "# Generate a profile\n",
    "profile = dp.Profiler(data)\n",
    "\n",
    "# Save a profile to disk for later (saves as pickle file)\n",
    "profile.save(filepath=\"my_profile.pkl\")\n",
    "\n",
    "# Load a profile from disk\n",
    "loaded_profile = dp.Profiler.load(\"my_profile.pkl\")\n",
    "\n",
    "# Report the compact version of the profile\n",
    "report = profile.report(report_options={\"output_format\":\"compact\"})\n",
    "print(json.dumps(report, indent=4))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8f9859c2",
   "metadata": {},
   "source": [
    "With the ability to save and load profiles, profiles can be generated via multiple machines then merged. Further, profiles can be stored and later used in applications such as change point detection, synthetic data generation, and more. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3571f2d0",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load a multiple files via the Data class\n",
    "filenames = [\"csv/sparse-first-and-last-column-header-and-author.txt\",\n",
    "             \"csv/sparse-first-and-last-column-skip-header.txt\"]\n",
    "data_objects = []\n",
    "for filename in filenames:\n",
    "    data_objects.append(dp.Data(os.path.join(data_path, filename)))\n",
    "\n",
    "\n",
    "# Generate and save profiles\n",
    "for i in range(len(data_objects)):\n",
    "    profile = dp.Profiler(data_objects[i])\n",
    "    profile.save(filepath=\"data-\"+str(i)+\".pkl\")\n",
    "\n",
    "\n",
    "# Load profiles and add them together\n",
    "profile = None\n",
    "for i in range(len(data_objects)):\n",
    "    if profile is None:\n",
    "        profile = dp.Profiler.load(\"data-\"+str(i)+\".pkl\")\n",
    "    else:\n",
    "        profile += dp.Profiler.load(\"data-\"+str(i)+\".pkl\")\n",
    "\n",
    "\n",
    "# Report the compact version of the profile\n",
    "report = profile.report(report_options={\"output_format\":\"compact\"})\n",
    "print(json.dumps(report, indent=4))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4690068a-8fc3-4bd5-8649-63d0f34fa91d",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
